perf(shuffles): Incrementally retrieve metadata in reduce #3545

colin-ho · 2024-12-11T00:38:38Z

Incrementally retrieve partition and metadata as fanouts are completed, instead of retrieving only after all are completed. This drastically speeds up large partition shuffles.

Before, we would see pauses between the map and reduce phase. (example below is a 1000x 1000x 100mb shuffle)

Now:

codspeed-hq · 2024-12-11T00:48:27Z

CodSpeed Performance Report

Merging #3545 will degrade performances by 13.84%

_{Comparing colin/incremental-metadata-retrieval (f15c7e5) with main (6157ad8)}

Summary

❌ 1 regressions
✅ 26 untouched benchmarks

⚠️ Please fix the performance issues or acknowledge them on CodSpeed.

Benchmarks breakdown

	Benchmark	`main`	`colin/incremental-metadata-retrieval`	Change
❌	`test_iter_rows_first_row[100 Small Files]`	192.6 ms	223.5 ms	-13.84%

codecov · 2024-12-11T01:06:36Z

Codecov Report

Attention: Patch coverage is 96.42857% with 1 line in your changes missing coverage. Please review.

Project coverage is 77.17%. Comparing base (c932ec9) to head (f15c7e5).
Report is 23 commits behind head on main.

Files with missing lines	Patch %	Lines
daft/execution/execution_step.py	96.15%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #3545      +/-   ##
==========================================
- Coverage   78.06%   77.17%   -0.90%     
==========================================
  Files         728      728              
  Lines       89967    91917    +1950     
==========================================
+ Hits        70236    70936     +700     
- Misses      19731    20981    +1250

Files with missing lines	Coverage Δ
daft/execution/physical_plan.py	`94.35% <100.00%> (ø)`
daft/execution/execution_step.py	`89.63% <96.15%> (+0.19%)`	⬆️

... and 43 files with indirect coverage changes

jaychia · 2024-12-12T10:31:05Z

Interesting. One suggestion I have is to maybe see if we can use ray.wait([...metadatas], fetch_local=True) (see: https://docs.ray.io/en/latest/ray-core/api/doc/ray.wait.html)

I wonder if this might simplify the logic here, and it might also fix some interesting OOM issues that I was observing wrt the workers that are holding the metadata objectrefs dying due to OOM.

Could we try that and see if it gives us the same effect? We have to figure out where is appropriate to call this on the metadata objectrefs though. I would look into _await_tasks on our RayRunner.

jaychia · 2024-12-12T10:35:02Z

Another idea... Looks like these metadatas are being retrieved using a Ray remote function:

@ray.remote
def get_metas(*partitions: MicroPartition) -> list[PartitionMetadata]:
    return [PartitionMetadata.from_table(partition) for partition in partitions]

My guess is that if the MicroPartition lives in the object store, we're having to page this data back in just to grab metadata. This is even worse if spilling occurred and we maybe have to page it back in from disk?

I wonder if there's a better way of dealing with this.

Edit: i think this might only be triggered from certain codepaths, and in most cases the metadata is returned as an objectref after execution. This code is so messy...

jaychia · 2024-12-12T13:11:02Z

Doing some experiments in #3557 but it seems like ray.wait(fetch_local=True) doesn't seem to do anything for the metadata in my case (I still hit OOMs). I'm guessing maybe for small objects like the metadata this doesn't actually work, and it only works for things in the object store?

Doing another round of tests now using an explicit ray.get and trying to explicitly set the cached metadata object in the result. If that works, then it could work well for your use-case also.

jaychia

So this PR looks good to me, but I wonder if we should take a more holistic approach here and maybe just fix how metadata is being passed around the system rather than patch it here specifically for the map/reduce shuffle.

Maybe something really dumb like hooking into a stats/metadata actor? Would love to get thoughts from @samster25

colin-ho · 2025-01-15T00:50:39Z

Doing another round of tests now using an explicit ray.get and trying to explicitly set the cached metadata object in the result. If that works, then it could work well for your use-case also.

Tried this out, essentially always fetch metadata and cache it upon partition task completion, unless it is the final partition task.

Tried this on a few shuffle configurations (1000 x 1000, 2000 x 2000), and it works pretty well, roughly 20s faster.

This could be viable? I believe as long as the task is not final, the partition metadata will need to be retrieved anyway for subsequent tasks, so might as well retrieve it once it is complete. Since it is a blocking call we want to pipeline it with running tasks, which should work if we retrieve it during the task awaiting state.

colin-ho · 2025-01-15T18:23:02Z

Made a refactor to have the metadata caching logic on the scheduler instead of on reduce op, ptal!

colin-ho · 2025-01-15T18:23:11Z

Made a refactor to have the metadata caching logic on the scheduler instead of on reduce op, ptal! @jaychia

colin-ho · 2025-01-16T19:58:15Z

Tested on TPCH SF 1000 on a 8 x i8.4xlarge node cluster.

Query	This PR	Main	Difference	% Change
Q1	20.48	20.55	-0.07	-0.34%
Q2	37.70	39.41	-1.71	-4.34%
Q3	27.53	27.53	0.00	0.00%
Q4	17.22	17.34	-0.12	-0.69%
Q5	115.50	120.19	-4.69	-3.90%
Q6	8.05	6.75	1.30	19.26%
Q7	70.65	68.93	1.72	2.50%
Q8	88.19	89.56	-1.37	-1.53%
Q9	197.09	206.90	-9.81	-4.74%
Q10	45.81	44.90	0.91	2.03%

Not much difference

jaychia

The idea sounds good to me, but wondering if we should just make it always retrieve metadata -- does that affect performance?

jaychia · 2025-01-17T18:32:42Z

daft/execution/physical_plan.py

@@ -1771,7 +1771,7 @@ def __iter__(self) -> MaterializedPhysicalPlan:
            try:
                step = next(self.child_plan)
                if isinstance(step, PartitionTaskBuilder):
-                    step = step.finalize_partition_task_single_output(stage_id=stage_id)
+                    step = step.finalize_partition_task_single_output(stage_id=stage_id, cache_metadata_on_done=False)


How cheap is this, and should we just always do this perhaps?

I believe a fetch is also going to be fetched if a user ever tries to display/show a dataframe, since it needs to figure out how large the entire dataframe is using the total row count. If that's the case, maybe we just simplify the logic here and always force a fetch?

I think doing this on the final step may introduce some regressions if the result set of a .collect is very large and metadata is not required, i.e. if the dataframe is not displayed, or perhaps it is an intermediate collect step. If it is a show then it should be quite cheap and this eager fetching probably won't help very much.

Colin Ho added 3 commits December 10, 2024 16:31

incremental metadata retrieval

a970b84

cleanup

25a31e9

cleanup

8ab38ea

github-actions bot added the perf label Dec 11, 2024

colin-ho requested a review from jaychia December 12, 2024 08:15

Merge branch 'main' into colin/incremental-metadata-retrieval

c1c95cd

jaychia reviewed Dec 17, 2024

View reviewed changes

Merge branch 'main' into colin/incremental-metadata-retrieval

41cb49a

colin-ho added 2 commits January 15, 2025 06:21

cache metadata

523feed

cache metadata

f15c7e5

colin-ho requested a review from jaychia January 15, 2025 18:23

jaychia approved these changes Jan 17, 2025

View reviewed changes

colin-ho merged commit 4b8397b into main Jan 20, 2025
42 of 43 checks passed

colin-ho deleted the colin/incremental-metadata-retrieval branch January 20, 2025 22:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(shuffles): Incrementally retrieve metadata in reduce #3545

perf(shuffles): Incrementally retrieve metadata in reduce #3545

colin-ho commented Dec 11, 2024 •

edited

Loading

codspeed-hq bot commented Dec 11, 2024 •

edited

Loading

codecov bot commented Dec 11, 2024 •

edited

Loading

jaychia commented Dec 12, 2024

jaychia commented Dec 12, 2024 •

edited

Loading

jaychia commented Dec 12, 2024

jaychia left a comment

colin-ho commented Jan 15, 2025

colin-ho commented Jan 15, 2025

colin-ho commented Jan 15, 2025

colin-ho commented Jan 16, 2025

jaychia left a comment

jaychia Jan 17, 2025

colin-ho Jan 20, 2025

perf(shuffles): Incrementally retrieve metadata in reduce #3545

perf(shuffles): Incrementally retrieve metadata in reduce #3545

Conversation

colin-ho commented Dec 11, 2024 • edited Loading

codspeed-hq bot commented Dec 11, 2024 • edited Loading

CodSpeed Performance Report

Merging #3545 will degrade performances by 13.84%

Summary

Benchmarks breakdown

codecov bot commented Dec 11, 2024 • edited Loading

Codecov Report

jaychia commented Dec 12, 2024

jaychia commented Dec 12, 2024 • edited Loading

jaychia commented Dec 12, 2024

jaychia left a comment

Choose a reason for hiding this comment

colin-ho commented Jan 15, 2025

colin-ho commented Jan 15, 2025

colin-ho commented Jan 15, 2025

colin-ho commented Jan 16, 2025

jaychia left a comment

Choose a reason for hiding this comment

jaychia Jan 17, 2025

Choose a reason for hiding this comment

colin-ho Jan 20, 2025

Choose a reason for hiding this comment

colin-ho commented Dec 11, 2024 •

edited

Loading

codspeed-hq bot commented Dec 11, 2024 •

edited

Loading

codecov bot commented Dec 11, 2024 •

edited

Loading

jaychia commented Dec 12, 2024 •

edited

Loading